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ABSTRACT 

We present tests of comparison between our versions of the Fast 
Multipole Algorithm (FMA) and "classic" tree-code to evaluate 
gravitational forces in particle systems. We have optimized the 
Greengard's original version of FMA allowing for a more efficient criterion 
of well-separation between boxes, to improve the adaptivity of the method 
(which is very important in highly inhomogeneous situations) and to 
permit the smoothing of gravitational interactions. 

The results of our tests indicate that the tree-code is almost three times 
faster than FMA for both a homogeneous and a clumped distribution, at 
least in the interval of N (N < 10 5 ) here investigated and at the same level 
of accuracy (error ~ 10~ 3 ). This order of accuracy is generally considered 
as the best compromise between CPU-time consumption and precision for 
astrophysical simulation. Moreover, the claimed linear dependence on N 
of the CPU-time of FMA is not confirmed and we give a "theoretical" 
explanation for that. 

1. Introduction 

The availability of fast computers is allowing a huge development of simulations 
of large A-body systems in Astrophysics, as well as in other fields of physics where 
the behaviour of large systems of particles have to be investigated. 

The heaviest computational part of a dynamical simulation of such systems 
(composed by point masses and/or smoothed particles representing a gas) is the 
evaluation of the long-range force, such as the gravitational one, acting on every 
particle and due to all the other particles of the system. Astrophysically realistic 
simulations require very large N (greater than 10 5 ), making the direct ~ N 2 pair 
gravitational interaction evaluations too slow to perform. To overcome this problem 
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various approximated techniques to compute gravitational interactions have been 
proposed. 

Among them, the tree-code algorithm proposed by Barnes & Hut (hereafter B.H., see 
[]) is now widely used in Astrophysics because it does not require any spatial fixed 
grid (like, for example, methods based on the solution of Poisson's equation). This 
makes it apt to follow very inhomogeneous and variable in time situations, typical of 
self-gravitating systems out of equilibrium. In fact its intrinsic capability to give a 
rapid evaluation of forces allows to dedicate more CPU-time to follow fast dynamical 
evolution, contrarily to other higher accuracy methods that are more suitable in other 
physical situations, e.g. molecular dynamics for polar fluids, where Coulomb term is 
present. 

While for the tree-code the CPU-time requirement scales as N log 8 N, in the 
recently proposed Fast Multipole Algorithm (hereafter FMA, see ||) this time is 
claimed to scale as N, at least in quasi-homogeneous 2-D particle distributions. Were 
this linear behaviour confirmed in 3-D highly non-uniform cases, FMA would really 
be appealing for use in astrophysical simulations. 

In this paper, we compare CPU times of our own implementations of adaptive 3-D 
FMA and tree-code to evaluate gravitational forces among N particles in two rather 
different (uniform and clumped) spatial configurations. 

Detailed descriptions of tree-code and FMA can be found in [], || and p 
respectively. For the purpose of this paper (the performance comparison of the two 
above mentioned methods in astrophysically realistic situations) we built our own 
computer versions of tree-code and FMA. Our tree-code was written following at 
most the prescription but for the short-range component of the interaction force 
(see P|), while our FMA is slightly different from the original proposed by Greengard 
||, to make it more efficient in non-uniform situations where adaptivity is important, 
and to include the smoothing of interactions which is quite useful in astrophysical 
simulations, as we will describe in Sect. || 

In Sect. 2 and 3 we briefly review the algorithms and give some details of our 
implementations, in particular for the FMA, while in Sect. 4 the comparison of FMA 
and tree-code CPU-time performances is presented and discussed. 
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2. 



The tree— code 



We have implementated our own version of the "classic" BH tree-code (see [] or 
0) which is well described in || (see also 0); we will give here only a brief discussion 
of the improvements we have introduced. 

In the tree-code the cubic volume of side I that encloses all the particles is 
subdivided in 8 cubic boxes (in 3-D) and each of them is further subdivided in 8 
children boxes and so on. The subdivision goes on recursively until the smallest boxes 
(the so called terminal boxes) have only one particle inside. Moreover the subdivision 
is local and adaptive in the sense that it is locally as more refined as the density is 
higher. The logical internal representation of this picture is a "tree data structure", 
from that the denomination tree-code. 

The gravitational force on a given particle in r is then computed considering 
the contribution of the "clusters" of particles contained in boxes that are sufficiently 
distant from the particle, that is in those boxes which satisfy the open-angle criterion: 



with d the distance between the particle and the center of mass of the cluster, 
4n — i/2 m the box size at level m of refinement and 9 > a parameter a priori 
fixed. This contribution is evaluated by means of a truncated multipole expansion 
that permits to represent the set of n particles contained in the box with a unique 
"entity" identified by a relatively low number of attributes (the total mass, the center 
of mass, the quadrupole moment..., that is the set of multipole coefficients), that are 
stored, in a previous step, in the tree data structure. In this way the evaluation of 
interactions is speeded-up compared to a direct pair-to-pair evaluation. Usually in 
tree-codes second order expansions are used, that is up to the quadrupole term, this 
approximation being sufficient in typical astrophysical simulations. Those boxes that 
do not satisfy the condition (|]), will be "opened" and their children boxes will be 
considered. This "tree descending" continues until one reaches a terminal box whose 
contribution to the field will be calculated directly. 

The parameter 6 and the order of truncation of the expansion, permit to have 
some control of the error made in the evaluation of the field in the generic particle 
position. With the second order of approximation we used and with 9 in the interval 
[0.7, 1] one obtaines a relative error less then about 1%, as we will see in the following. 

In our version of the tree-code we considered a gravitational smoothing. This 
means that each particle is represented by a /3-spline (a polinomial function, with 
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a compact support, differential) le up to the second order) that gives a Newtonian 
potential outside the sphere centered at the position of the particle and of radius 
2e, while inside that sphere the field is smoothed; see for example J7[ and ||. This 
smoothing avoids divergence in the accelerations during too close approaches between 
particles whose trajectories would be not well integrated in time by the usual 
numerical procedures (like, for example, the leap-frog scheme). In the particular case 
of our static test, i.e. without considering any time evolution, the smoothing would 
only be an unnecessary complication; consequently in these tests we fix e = for every 
particle (exactly Newtonian potential). 

However because of the presence, in general, of a smoothing of the field we have 
modified the open-angle criteria in the following way: the box of size I with center 
of mass at R is "sufficiently distant" from the particle in r, i.e. its potential can be 
expandend in a multipole series, if 

|R-r| > max{£/fl,2e} (2) 

where 2e is the gravitational smoothing "radius" of the particle in r. 

3. The Fast Multipole Algorithm 

In a way similar to the tree-code, the FMA is based on an approximation of the 
gravitational field produced by a set of sufficiently distant particles over a generic 
particle. It uses the same logical "octal tree-structure" of the hierarchical subdivision 
of the space as the tree-code, with the only difference that instead of stopping 
subdivision when boxes contain single particles, in this case the recursive subdivision 
ends up at boxes containing no more than a fixed number s > 1 of particles. This 
to reduce the CPU-time required for the calculation. We still call terminal boxes the 
boxes located at the "leaves" of the logical tree-structure. Also the FMA uses the 
truncated multipole expansion, but in a different and more complicated form. The 
advantage is that of a more rigorouse control of the truncation error. 

In general the spherical harmonics expansion works for particles interacting 
via a potential $(r, r') oc l/\r — r'|, so, even if we will always speak in terms of 
gravitational field the treatment is the same in the case of electrostatic field. The only 
differences are the coupling constant and the attribute of each particle (the mass in 
the gravitational case, the charge with its sign in the electrostatic case). 

The FMA takes advantage of that if we know the coefficients of the multipole 
expansion that gives the potential in a point P, we are able to build a Taylor 
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expansion — the so called local expansion — of the potential in a neighbourhood of P 
(see Appendix 0). This expansion is used to evaluate the acceleration of a particle 
near P, instead of re-evaluating the multipole expansions of the field due to all 
collections of distant particles as it happens with the tree-code. 

Furthermore, besides the interaction particle-particle and particle-box, FMA 
considers also an interaction box-box which is estimated by a truncated multipolar 
expansion, if and only if the boxes are sufficiently distant one each other, that is if 
they satisfy the so called well-separation criterion (which substitutes the open-angle 
criterion of the tree-code). Boxes which are not well-separated, will be "opened" and 
their children boxes considered. In this way one can control the error introduced by 
the truncation of the multipolar expansion, having this error, as we will see, a well 
defined upper bound. 

In fact we know that, if we have a set of k particles at Pj = (/^,atj, (we will 
use spherical coordinates) having masses m^, which is enclosed in a sphere A with 
center in the origin and radius a, and another set of I particles at Q t = (r*, 0$, 0$) 
enclosed in a sphere B centered in Qq with radius b, then the approximated potential 
generated by the set of particles in A at the position of the z-th particle in B, is given 
by a multipole expansion truncated at the order p: 

P n ]\/f m 

W^-gE E -^n^M- (3) 

n=0 m=—n '"i 

The functions Y™{9,(j)) are the spherical harmonics and 

C^E^C^A)- ( 4 ) 

1=1 

Now one can show (see H) that, for any p > 1, 

A / n \ P+ 1 

mQi)-*p{Qi)\<^-(-) , (5) 

where $ is the exact potential and A = GY%=i m i- This result gives the possibility to 
limit the error introduced by approximating $ with the expression (|3]), if the two sets 
of particles are sufficiently distant. 

Let d be the distance between the centers of the two spheres and let us define 
a = d — a — b the separation between them. Obviously the set of particles in B is such 
that r i > a + o for any i = 1, I. Then from we find: 

A$ = max {!«&(<&) - <£ p (Q,)l} < - f-^)^ ■ (6) 
i=i,...,z i. > a \o~ + a) 
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The maximum error of the truncated multipole expansion depends on two quantities: 
the order of the expansion p and the separation a. 

In his original algorithm, Greengard introduced the concept of well-separation 
between two boxes of the same level of refinement, i.e. with the same size £, to control 
the truncation error. Suppose to have two sets of particles in two boxes circumscribed 
by two spheres A and B, with equal radii a = b = i^/3/2: these two boxes are 
well-separated if, and only if, their separation is such that a > a, i.e. d > 3a. In this 
way the upper bound on the error will be, from eq. (H), 

A /l\ p+1 

A*<-(i) • (7) 



a \2, 

To implement this criterion, Greengard imposed simply that: two boxes of the same 
level are well-separated if, and only if, there are at least two layers of boxes of that 
level between them, in such a way to satisfy the inequality d > 3a and finally the ([I]). 

Note that Greengard's well-separation criterion refers exclusively to boxes at the 
same level of refinement, limiting the efficiency of the algorithm especially in the case 
of non-uniform distributions of particles. 

We have modified this criterion to make it more efficient (in the adaptive form of 
the algorithm) to face astrophysical typical distributions very far from uniformity. 
Let us briefly explain our modification. 

First of all, every box is associated to a sphere that is not merely the sphere 
circumscribing the box but the smallest one containing all its particle. So, for terminal 
boxes, is the smallest sphere concentric to the box and that contains all its particles 
and for a non-terminal box of size £, the sphere is also concentric to the box but the 
radius r is calculated recursively, knowing the radii r, (i = 1,...,8) of the spheres 
associated to each of its children unempty boxes, via the formula 

r = max {rA + iV3/2. (8) 

t=l,...,8 

This sphere contains all the spheres associated to the children boxesQ . The value of 
the radius of these spheres is stored in the "tree" data structure together with the 
other data pertinent to the box and it is much more representative of the "size" of 
the set of particles in the box, than the mere box size £, for controlling the truncation 
error. 

Now consider the box A centered in the origin O and containing the collection of k 



lr This is to preserve the same error bound as in eq. (|^), for the truncated expansions obtained 
translating and composing the local and the multipole expansions as discussed in Appendix |A| 



Fig. 1. — Example of two well-separated boxes, A and B (see text). A and B are the 
associated spheres enclosing the two set of particles. 

particles belonging to the sphere A seen before. Let this sphere A be its associated 
sphere with radius a (see Fig. |l]). Let the other set of particles at Q i} i = 1, be 
enclosed in the box B whose associated sphere be the sphere B centered in Q and 
with radius b. The radii a, b have been evaluated by the (||). Let us suppose that the 
separation of the two spheres is such that a > S ■ a, with 5 > a fixed parameter (see 
again Fig.|l|). Obviously the set of particles in B is such that, from eq. (§), 



and we can note that the parameter 5 works in a way similar to 6 in the tree-code. 

Hence, our criterion of well-separation is the following: once we fixed the 
parameter 5, we define two boxes as well-separated, if the distance between their 
centers al is such that 



that is if the separation between their associated spheres is a > 5 ■ a. In this way the 
evaluation of the interaction between the sets of particles contained into such boxes 
will be affected by a truncation error which is bounded by the (|9|). 

The implementation of this criterion is very easy because it consists just in 
verifying the (|10|) where the radii of the spheres associated to the boxes have been 




(9) 



d>a + b + 5-a, 



(10) 
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already calculated and stored, together with all the other data of each box, in the 
phase of the algorithm in which the tree-structure is built. 

Our criterion of well-separation is more efficient than the Greengard's original 
one, having the following features: 

• it depends upon the internal distribution of particles in the box (via the radius 
of the associated sphere); 

• it can involve boxes of different level of refinement, having different sizes, so 
to improve efficiency in non-uniform situations and to make unnecessary those 
complicated tricks conceived to shorten the list of the well-separated boxes like, 
for example, the mechanism of "parental conversion", (see ||); 

• it can be tuned in such a way to obtain the desired accuracy, via the parameter 

• it can be easily modified in order to allow a gravitational smoothing to be applied 
to the particles, as we will see below. 

Note also, see the (f7|), that to control the truncation error, Greengard varied 
p according to the desired accuracy. In our version we have, instead, fixed p = 2 
and achieved varied the parameter 5 in order to obtain the same accuracy that is 
usually obtained with the tree-code in astrophysical simulations. One has to care 
with keeping reasonably low the execution CPU-time (be small compared with 
human time scale!) and this is obtained at a price of a certain loss in accuracy in the 
evaluation of interactions. In fact the main characteristics of astrophysical simulations 
of gravitating systems are: 

1. they require a great number of particles (usually more than 10 4 ) for a rather 
large duration of the simulation^] and, in particular, 

2. due to the intrinsic instability they offer a very wide distribution of time scales. 

So while simulating polar fluids in molecular dynamics, one has to face 
"microscopic" time scales more narrowly distributed (just because molecules tend to 



2 several dynamical times: that is many times the typical time scale of the entire system, such as 
the sound crossing time for a collisional system or the core-crossing time for a collisionless one 
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repulse each other due to the presence of short-range interactions, like the Lennard- 
Jones potential) and one can work with expansions truncated up to eighth order or 
more (although the CPU-time grows as p 4 , see |5[]), in astrophysical simulations one 
prefers to limit the precision at a lower but reasonable level and, on the other hand, to 
be able to process systems which are highly dynamical. Therefore one usually works 
with expansions truncated up to the second order (in some cases even the first order) 
that, in the tree-code, corresponds to consider up to the quadrupole moment. 

For the mass density of the single particle we used the same (3 -spline profile as 
we did in our tree-code. In this way the potential is exactly Newtonian outside the 
sphere of radius^ 2ej centered in r i; so it can be expanded in multipole series only in 
this region. Hence the interaction with a particle inside the sphere of radius 2ej must 
be necessarily evaluated by means of a direct summation. This requirement, which 
would be very difficult to incorporate in Greengard's criterium, has been considered 
via a little arrangement of our well-separation criterium (|l0l), that is: 
two boxes (A and B) are well-separated if, and only if, their spheres (A, with radius a 
and B with radius b respectively) are such that: 

d > a + b + max{<5 • a, 2e^} (11) 

where d is the distance between the centers of the two spheres. 

The smoothing length used for the box A is another quantity stored in the 
tree data structure and it is given by this simple recursive scheme: if A is terminal, 
then = maxjjej}, where is the smoothing lenght of the particle i contained in A, 
otherwise = max^je^}, with the maximum taken over all the unempty children 
boxes C of A. Anyway in the following comparison tests, we let e = (as for the 
tree-code) because we are not interested in the dynamical evolution of the system. 

For a more detailed description of our own implementaton of FMA see Appendixes 
[A] and fj. 

4. Codes performance comparison 

To compare the CPU-time spent by the two algorithms described in the 
previous Sections, we ran the codes on a IBM R6000 workstation with two different 



3 In principle each particle is allowed to have its own smoothing length. This is useful when one 
has to simulate gravitational interactions between both collisionless and collisional particles (i.e. fluid 
elements). 
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distributions of particles. In the first case N particles have been distributed a 
uniformly at random in a sphere of unitary radius. In the second case a set of N 
particles has been distributed, with a Monte-Carlo method, in a unitary sphere in 
such a way to discretize the density profile: 



with r c = 0.2 (obviously p = for r > 1). This latter is known as Schuster's 
profile; it corresponds to a polytropic sphere (of index 5) at equilibrium (see []) and 
represents a good approximation to the density distribution of various stellar systems. 
In both the uniform and clumped case all the particles are assumed to have the same 
mass. 

The order of accuracy chosen was the same for both the tree-code and the 
FMA. For accuracy we mean how close, in modulus, the evaluated forces are to those 
calculated "exactly" by a direct, Particle-Particle (PP) method, which is affected 
only by the numerical error of the computer (due to the finite number of digits). 
Consequently we define as relative error of the calculation e: 

1 n | PP\ 

where Oj is the modulus of the acceleration of the i-th particle estimated by each of 
the two algorithms and af p that computed by the PP method. The error on the 
direction of the forces is much lower then the error on the modulus we have defined 
above, and, as it is usual in N-body numerical method, it is not considered at all in 
performance tests being negligible. 

Figure |2| gives the relative error of the tree-code and of the FMA in the uniform 
and "clumped" case. The error is almost the same for both the algorithms. An 
averaged (on all the particles) relative error less than 1% (this is the order of 
magnitude of the error generally admitted in astrophysical simulations), is obtained 
fixing 

9 = 0.7 for the tree-code, (14) 
S = 2.5 for the FMA, (15) 

and considering, as we have said, up to the quadrupole term in the tree-code and to 
the same order term in the multipole expansion (p = 2) of the FMA. Moreover we 
chose s = 10 in the FMA as the best compromise between accuracy and computational 
speed, as we checked. 
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Fig. 2. — Averaged relative error vs. number of particles for both algorithms. 

The CPU-time spent to calculate the accelerations vs. the number N of particles 
is shown in Fig.|3[ The CPU-time for the Particle-Particle method is also shown as 
reference. Both the algorithms are slower to compute forces in the non-uniform model 
than in the uniform one, and for the tree-code this is more evident. This is clearly 
due to the more complicated and non-uniform spatial subdivision in boxes that affects 
mostly the tree-code due to the finer and deeper subdivision of the space it uses. 
Anyway the tree-code shows to be faster than the FMA for both the distributions and 
for iV varying in the range we tested. As expected, the behaviour of the CPU-time 
vs. N for the tree-code is well fitted by the logaritmic law 

t cpu = aNlogN + (3, (16) 

where a and f3 are given in Table |I[ In our opinion, this law must be followed by the 



FMA too, as we will explain in Section 4.1 



However, let us observe the CPU-time for the uniform case: it is not easy to 
distinguish at a first sight a logarithmic behaviour from a linear one, furthermore 
we can presume, as we can observe in Figure [3], that the FMA must show a more 
complicated behaviour due to the presence of the parameter s (the maximum number 
of particles leaved in terminal boxes) and to that s was kept fixed. Hence for a 
certain range of iV this s could have been chosen as optimum, but could have not 
been so for other values of N (in those ranges in which the CPU-time shows to grow 
excessively with a slope larger than that of the tree-code and comparable to that of 
the PP method). Blelloch and Narlikar [|IJ have obtained similar "undulations" in the 
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Fig. 3. — CPU-time (on an IBM R6000 machine) vs. number of particles for both the 
algorithms and for the direct Particle-Particle method. 

behaviour of the CPU-time of their version of FMA, while the behaviour of tree-code 
was much "cleaner" , as in our tests. 

Coming back to the accuracy, note that for iV > 8192, the rel. error of the FMA 
exceeds that of the tree-code (see Fig.^). In the same range of N the CPU-time of 
the FMA grows less rapidly than that of the tree-code. This follows the obvious gross 
rule by which the faster the calculation the lower the precision in the results. So for 
N > 8192 the value s — 10 fixed is likely too small and consequently the percentage 
of direct (PP) calculation is decreased, so that FMA loses precision. On the contrary, 
for N < 4096 we observe, especially in the Schuster case, that Efma < £tc\ m the same 
region the CPU-time for FMA grows more rapidly than for the tree-code, so in this 
case s seems to be too high. 



Uniform 



Algorithm 


a 


P 


tree-code 
FMA 


1.1 • 10 -3 
4- 10~ 3 


-1.1 
-7.4 



Schuster 



Algorithm 


a 


P 


tree-code 
FMA 


1.6 • 10~ 3 
4.9 • 10~ 3 


-1.7 
-10 



Table 1: Values obtained for the parameter, fitting the behaviour of the CPU-time for 
both algorithms with the law: t cpu = aiVlog 8 N + (3. 



- 14 - 



It is interesting to note that the relative error of the tree-code shows a decrease 
at increasing the number of particles. This is probably due to the large fraction of 
direct, particle-particle calculations, because of the growing density of particles (the 
volume occupied by the system remains the same). This explains also the lower error 
in the clumped case than in the uniform one at a given N, because in the Schuster's 
profile the central density of particles is clearly higher than in the uniform case. 

Finally we can see that our FMA becomes faster than the PP method for 
N > 18, 000 in the Schuster's model, and for iV > 16, 000 in the uniform case. 
To make a comparison with codes of other authors, let us consider the FMA 
implemented in 3-D by Schmidt and Lee |TtJ (although this code is not adaptive) 
following the original Greengard's algorithm and, in particular, his well-separation 
criterion. Their FMA is vectorized and it runs on a CRAY Y-MP, but anyway we do 
not make a direct comparison, but rather compare our "CPU-time ratio" (that is the 
ratio between the CPU-time consumed by FMA and that consumed by the direct PP 
method), with the same ratio as obtained by the codes of Schmidt and Lee, at the 
same order of magnitude of the error on the forces. 

So we can note (see |TIJ) that for N = 10 4 uniformly distributed particles, a 
truncation of eighth order (p = 8) and five levels of refinement (a priori chosen), they 
obtain a CPU-time of 418 sec. for their FMA, against 11 sec. spent by the direct PP 
method, with a relative error on the forces of about 1.4 • 10~ 3 . Thus they obtain a 
ratio t { ™A) j t {PP) „ 40; while with our codes we see that for the uniform distribution 
with iV = 10 4 , we have an error on the forces that is ~ 3 • 10~ 3 (see Fig.Q) with a 
CPU-time ratio tg^^/tffl ~ 1, i.e. about the same CPU-time spent by the PP 
method (see Fig.|3|). 

This seems a very good result. 



4.1. Scaling of CPU— times versus N 

Greengard and other authors (H, [0,0) assert that FMA would exhibit a linear 
scaling of CPU-time vs. N. We tried to fit v^^ A ' as function of N, with various 
laws, the linear included. The result is that a logarithmic behaviour like that of eq. 
(IHf), gives the best fit, as for the tree-code. 

The values obtained for a and (3 in Table [I] show that ctFMAlcttc ~ 3: the 
tree-code is roughly three times faster than the FMA. Note how this difference of 
performances reduces slightly passing to the Schuster clumped profile; this because, 
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as we have said, our FMA adaptive code is less sensitive to the degree of uniformity 
of the distribution of the particles than the tree-code which uses a finer subdivision 
of the space in boxes (as it corresponds to s = 1). 

The higher speed of the tree-code respect to the FMA is easily understood, at 
least in 3-D, since while in theory the FMA is more efficient and less "redundant" in 
managing informations (remember the use of Taylor expansion of the potential on 
near bodies), in practice this "potential" greater efficiency pays the price of a certain 
quantity of computational "complications" . This carries the method to a negative 
total balance in terms of speed respect to the competing tree-code. 

How can we interpret the CPU-time scaling? 
The logarithmic behaviour of the tree-code is explained by a simple estimate of the 
number of operations needed by the various steps of the algorithm (see e.g. [],|| and 
H). It is roughly given by the product between the number of particles N by the 
number of "bodies" (boxes or particles), about log 8 iV, which contribute to the force 
on each particle. Hence t^l ~ Nlog 8 N. 

The logarithmic behaviour of the FMA can be similarly understood when one 
reconsiders carefully the cost of each step of the algorithm. 
It can be estimated (see f|) that 

~ aN + bN ter + cN ne , (17) 

that is the CPU-time is a linear function of N, N ter (the number of all terminal 
boxes) and N ne (the number of all non-empty boxes). Greengard B in his final 
considerations on the scaling of the FMA in the adaptive 2-D version (the 3-D case is 
similar), estimates the number of this types of boxes (see lemmas 2.6.4 and 2.6.5 in 
§) to be 

N 

N ter ~ 4-L-- (18) 
s 

N 

N ne ~ 5-L-- (19) 

s 

where L is the total number of subdivisions needed to reach terminal boxes. This L 
is identified by Greengard with L ~ log 2 (l/A), where A, a priori fixed, is the spatial 
resolution that one wants to reach in the simulation. Being A fixed, L would be 
independent of N, thus N ter and N ne , in eq. ([17]), are quantities linear in N. Then the 
FMA CPU-time estimated by the eq. ( |17D results linear in iV too. 

The crucial point is that a constant A (and L) allows to manipulate only those 
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distribution of particles such that A < r min , with r min the minimum distance between 
a pair of particles (see observation 2.5.1 in 

Obviously, in 3-D, r min ~ iV~ 1//3 , so that A < r min implies 

L>ilog 2 iV = log 8 iV (20) 

Substituting this inequality in the expressions ( p^) and fll9|) for N ne and i\T ter , one 
obtains from eq. QT7]) that igf A) > iV log 8 JV, that is the same "natural" scaling of 
the tree-code. 



5. Conclusions 

Realistic astrophysical simulations are characterized by the large amount of 
computations required by the evaluation of gravitational forces. Many codes which 
give performant approximation of the force field have been proposed in the literature. 
In this paper we compare two of these codes: one (the BH tree-code) has been largely 
used in astrophysics for ten years, the other (the Greengard's FMA ||) has given 
promising results in the field of molecular dynamics. 

We have so implemented our own optimized serial versions of both the tree-code and 
adaptive, 3-D, FMA. 

The results of our comparison tests indicate the tree-code as faster than FMA 
over all the interval of total number of particles (N < 10 5 ) allowed by the central 
memory capacity of a "typical" workstation. This maximum value of N, which 
could appear low respect to modern parallel simulations (up to AT ~ 10 7 ), is anyway 
meaningful because even fully parallel codes are limited by an individual processor 
charge of that order of magnitude. The problems in the parallelization of the two codes 
are comparable due to their similar structure, thus the higher speed of the tree-code, 
here verified in a serial context, should be confirmed in the parallel implementation 
and it seems a valid reason to concentrate efforts for the most efficient parallelization 
of the tree-code. 

At the end of this paper we have discussed the dependence of the FMA CPU-time 
on A" and given explanation of why its behaviour is similar to that of the classic 
tree-code. 
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A. Manipulation of Multipole Expansions in the FMA 

Here we briefly describe the three theorems, due to Greengard [[J, that permits 
the manipulation, in 3-D, of the various series expansions used in the algorithm, and 
that are useful for the deeper and formal description of our own implementation that 
follows in the next Appendix [B| 

We have said that one can "transform" the multipole expansion (|3|) into a local 
expansion useful to evaluate the field about a given point P. More precisely, given 
the set of k particles at Pi in the sphere A (associated to the box A, see the Fig. 
[1]) which produce the gravitational potential over the set of particles in the 

sphere B (box B) with center Q , then one can show that in the vicinity of Qq the 
approximated potential 

9p(Q<) = " G EE Ljl?(M)(r|)', (Al) 

3=0 k=-j 

where Qi — Qo = (r^, 0^,$), differs from the exact $(Q;,) of an amount bounded by 
the same expression that appears in the r.h.s. of the eq. (|]). In this truncated local 
expansion the coefficients are given by: 

p n 

l ) = y: e s^M^Y^i^mpi)-^- 1 , (A2) 

n=0 m=— n 

M™ being the same that appears in the expression (|3|) and a matrix of coefficients 
(see Appendix 

Another theorem allows us to calculate M™ in a recursive manner. Let us 
consider the partition of the set of k particles in the box A, in the q < 8 sub-sets each 
of them enclosed in the spheres associated to the children boxes of A and with centers 
in (pW,a (1) ,/? (1) ),..., (p (q \a iq \(3 {q) ). Let M™ (1) , M™ (2) , M™ (<?) be the multipole 
coefficients calculated for each of the sub-sets of particles. If all these spheres are 
enclosed in the sphere A (this is automatically satisfied because of the eq. @), the 
coefficients Mj given by: 

j n 

Mf = EE Sfl n ^zf q W~ m {a^\^W q) T, (A3) 

n=0 m=—n 

(see the Appendix |C| for the matrix ) give a potential 

P n A/f m 

W = -GE E t^C(M) (A4) 

n=0 m=—n 
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(where P = (r, 9, 0) is a generic point outside the sphere A), which well approximates 
the exact potential $(-P) generated by all the k particles in the sphere A. In fact if P 
is the position of a particle in the well-separated box B, then one can show that: 

the truncation error has the same upper bound given by @. 

So we can compute associated to the box A containing the total set of 
particles knowing only those pertinent to the q sub-sets in each children box. In 
the tree-code the same happens, but there the analogous theorem — the "quadrupole 
composition theorem" — has been developed specifically for the quadrupole moment by 
Goldstein || (those for the monopole and the dipole are obvious). In the FMA this 
theorem works for coefficients of any order and it has an upper error bound. 

Thus, once M™ have been calculated for all terminal boxes using the definition 
(|), by means of ( \K3j ) we can compute recursively the coefficients of parent boxes 
ascending the tree-structure. This coefficients will be transformed, when needed, 
in the local expansion coefficients (as we will see in more details in Appendix |B|) 
necessary to calculate forces by ( |A1| ). In this way, we will be sure that the error made 
in approximating the "true" potential with the various expansions, will always be 
bounded by the (|A5|). 

The last theorem concerns with the translation and composition of the local 
expansion coefficients (briefly Taylor coefficients) . In this case the rules of composition 
works, in a certain sense, inversely. That is, given the coefficients relative to the 
set of k particles in the sphere A, such that the potential ^ P (P) (P = (r, 9, 0) is a 
point inside A) given by the truncated local expansion about the origin O: 

W = E L)Y*{9^)r\ (A6) 

jf=0 k=-j 

is such that 

i* p( P)-*(P)i<^(_) • < A7 > 

then at the same point but with another origin Q e A, we have the equality 

%(P) = ~ott L^{9'A'){r'y (A8) 

j=0 k=-j 

where P — Q = (r', 9', <$') and where the new translated coefficients are: 

L «J = £ £ sfl n ^Y™-\a^)p^ (A9) 

n=j m=—n 
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being O — Q = (p, a, (3) (for the matrix see Appendix 

Thus given the L k - for a box, we can compute the Taylor coefficients for all the 
unempty children boxes using the above formula, with the new origin Q at the center 
of each children boxes. The process has to be recursively iterated until we reach 
terminal boxes. But how can we obtain the coefficients of a box B "the first time" 
(not knowing those of its parent box)? Obviously they will be calculated by means of 
the (|A2|) , transforming the multipole coefficients of sufficiently distant boxes. That is 
of boxes that are well- separated from B. 

B. Formal description of the FMA algorithm 

Here we describe in deeper detail our version of the FMA algorithm that is sligthly 
different from the original Greengard's algorithm in the adaptive implementation. 
The differences regard mainly the way interactions between distant boxes and the 
set of particles in a terminal box are calculated. Moreover, as we have said, we have 
modified the Greengard's well-separation criterion to take into account the presence 
of a smoothing of the interaction that in astrophysical simulations, contrarily to 
molecular dynamics, is unavoidable to include. 

Let us first introduce some useful definitions: in the following s indicates the 
maximum number of particles in the terminal boxes and the "calligraphic" letters 
refer to collections of boxes while simple capitals letters refer to single box. 

• Imax is the maximum level of refinement reached in the space subdivision; 

• Ia indicates the level of box A, whereas the level of the root box R, that is the 
box containing all the particles, is 0; 

• £>(/) is the set of all boxes at level I of refinement; 

• M(A) is the parent box of box A; 

• C(A) is the set of all children boxes of box A; 

• C(S) = Ua£sC(A) is the set of all children boxes of each box in the set S; 

• X(A) indicates the set of boxes, of level ^ or 1^ + 1, that are NOT well separated 
from box A. The set contains all the brothers of box A, but not A itself; 
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• T is the set of terminal boxes, that is such boxes that have no children because 
they contain less than s + 1 particles, so they have not been subdivided; 

• ha, with A G T, is the number of particles inside the terminal boxes A 
(obviously ua < s); 

• oIab represents the distance between the geometrical centers of boxes A and B; 

• €a is the length of the gravitational smoothing relative to the box A; 

• ta is the radius of the sphere that contains all the particles in the box A and 
that is concentric to it (see text). 

The notation 'do n — a, ft (with b > a integers) means that all passages included 
between this statement and the correspondent 'end do', are repeated b — a + 1 times 
and every time the integer variable n takes the values: a, a + 1, b — l,b, like in the 
Fortran, while the notation 'do A G <S' means, in this case, that every time the loop is 
executed the box A represents one of the various boxes in the set S. So the statements 
between 'do' and the related 'end do' are repeated Card{5} times and each time with 
a different box A G S. For example, if S = {Ai, A 2 , A 3 , A n }, the box A is Ai the 
first time the loop is executed, A 2 , the second time and so on. However, the order 
the boxes have in the set S has no importance in the algorithm. On the contrary 
in the first case of 'do ... end do', the order in the values that n takes everytime is 
important. Another notation is 'do while condition 1 , meaning that it will be executed 
the statements between this 'do while ...' and the correspondent 'end do', while the 
logical condition keeps true. 

Calculate recursively the multipole coefficients for all the boxes of each level, starting 
from the terminal boxes. This procedure is the same that in the tree-code, but with the 
difference that in the FMA the multipole expansion is calculated with the origin in the 
geometrical center of the boxes, so the first order coefficient (the dipole moment) does 
not vanish. Another complication is that in the FMA this coefficients are necessarily 
complex quantities. For terminal boxes use the eq. ([|), while for the others use eq. 

Calculate the radius r of the sphere that contains all the particles in each box. If a 
box is terminal then r = maxjijr-j — R|}, where r-j is the position of the particle i in the 
box and R is its center. If it is not a terminal box the radius is calculated by means of 

(§■ 

let X(R) = 

do I — 1, l m ax 
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do C € B(l) 
let X{C) = 

if / > 1 then translate, if they exist, the coefficients of the local expansion of the 
parent box M(C) about the center of box C using eq. (|A9|) . 

if C T then [the box C isn't terminal] 

doBe X(M(C)) 

if dsc > max{2e#, 5 ■ r B } + r B + then 

convert the multipole coefficients of box B to Taylor coefficients about 
the center of box C with eq. (|A2|) , because it is weZZ separated from C 
and the sphere where the field generated by the masses in B is smoothed, 
does not intersect the sphere associated to C. Sum the Taylor coefficients 
to the pre-existent ones. 

else 

do B x G C(B) 

if d Bl c > max{2e Bl , 5 ■ r Bl } + r Bl + r c then 

convert the multipole coefficients of box B\ to Taylor coefficients 
of box C with eq. (|A^), because it is well separated from C and 
the the sphere where the field generated by the masses in B\ is 
smoothed, does not intersect the sphere associated to C. Sum the 
Taylor coeff. to the pre-existent ones. 

else 

put B\ into the collection X{C). 
end if 
end do 
end if 
end do 
else [the box C is terminal 
let A = X(M(C)) 
do while A not empty 
do B G A 
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if B G T then [the box B is terminal 
eliminate B from the set A 

do i = 1, nc* 
do j = l,n B 

Sum directly (i.e. without any expansion) to the grav. field 
on the particle i that due to the particle j taking into account 
the grav. smoothing 

end do 

end do 

else [the box B isn't terminal] 

if oIbc > max{2eB,<5 • rs] + rs + rc then [the box C is well-sep. 
from the box B and it is outside its smoothing sphere] 

eliminate B from the set A 

Sum to the Taylor coefficients of the box C those obtained 
transforming the multipole coefficients of the box B by means of 

end if 
end if 
end do 

let A = C(A) [Now indicate with A the collection of all the children boxes 
of each box in the precedent set A. This means that we are descending the 
tree to the next level] 

end do 

do i = 1, nc 

Calculate the local expansion of the grav. field in the position of the 



particle i, using the (A.1) and the coefficients pertinent to the box C, 



summing to the accelerations calculated up to now. 
end do 
end if 



end do 
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end do 

Note that we have simplified the way forces on the particles in terminal 
boxes are evaluated. In Greengard's adaptive algorithm this is made by means of 
complicated passages and classifications of boxes into many several collections that 
are computationally expensive to build up. 

In our opinion this complication is unnecessary, because when one has to consider 
a terminal box for wich one has the long-range component of the potential in terms of 
Taylor coefficients (translated from those of its parent box), one has only to calculate 
the short-range forces on the n particles (with n < s) inside the terminal box, due to 
a certain set of near boxes and this can be done in the most efficient way by means 
of the same kind of passages that in the tree-code are used to evaluate the force on a 
single particle. 

Suppose we have to evaluate forces on tlq particles in the terminal box C. 
When we deal with a non-terminal box B and this box is not well-separated from 
C, then it will be subdivided considering its children boxes and the subdivision is 
recursively repeated until we reach either terminal or well-separated boxes. The 
contribution due to terminal boxes will be calculated directly, that is summing 
particle-particle interactions. The contibution due to well-separated boxes will be 
evaluated converting their multipole expansion coefficients into Taylor ones, summing 
them to the pre-existent cofficients of the box C and then, in a following passage, 
using these coefficients and the Taylor expansion to evaluate gravitational forces at 
the points occupied by the particle in C. This is done in the last statements of the 
above description (from the 'do while forward). 



C. The matrices of coefficients 

Defining A) = [(j - k)\(j + k)\}~ 1/2 , we have: 

q(1) — r>m Am \k i \m-k ( r^-i \ 

D j,fe,n,m — n k,n n -n n -jl n -j+n 

q(2) _ /~ik—m Am Ak—m I Ak /no\ 

r<(3) _ jjm Am-k Ak 

■ > j,k,n,m — ^n-j^-k^n-j 



rt(3) _ n m Am-k Ak I Am i fio\ 

°j,fc,n,m — u n-i,m-k- n -a-i A j I A n \^°) 



where 



Rm _ / 1V J (-l) min «l} if m.k>0 

= 1 otherwise (C4) 



- 24 - 



Ci = 



D s 



(_]_)min{|r|,|s|} 
1 



= (-ir 



if r • s < 
otherwise 

if m • s < 
if m • s > and |s| < \m\ 
otherwise 



(C5) 
(C6) 
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